Model assumptions

assumptions

Daniela Palleschi

Humboldt-Universität zu Berlin

2023-04-13

knitr::opts_chunk$set(eval = T, # change this to 'eval = T' to reproduce the analyses; make sure to comment out
                      echo = T, # 'print code chunk?'
                      message = F, # 'print messages (e.g., warnings)?'
                      error = F,
                      warning = F)

Learning objectives

Today we will learn…

  • how to check model assumptions
  • how to transform our data to meet model assumptions

Resources

Sections 6.3, 6.4 and 7.9 in Winter (2019)

Set-up

  1. Start with a clean R Environment (Session > Restart R).

  2. Suppress scientific notation (make very small numbers easier to read):

# suppress scientific notation
options(scipen=999)

Packages

pacman::p_load(tidyverse,
               broom,
               patchwork,
               sjPlot,
               knitr,
               kableExtra,
               Rmisc)

Data

  • force character variables to factors
  • filter for the verb region from critical items only, remove participant 3, and remove values of first-fixtation that are
# load in dataset
df_crit_verb <-
  readr::read_csv(
    here::here("data/tidy_data_lifetime_pilot.csv"),
    # for special characters
    locale = readr::locale(encoding = "latin1")
  ) |>
  mutate_if(is.character, as.factor) |> # all character variables as factor
  mutate(lifetime = fct_relevel(lifetime, "living", "dead"),
         tense = fct_relevel(tense, "PP", "SF")) |>
  filter(type == "critical", # only critical trials
         px != "px3", # px3 had a lot of missing values
         fp > 0, # only values of fp above 0
         region == "verb") %>% # critical region only
  droplevels() # remove any factor levels with no observations

Review: (multiple) linear regression

  • when we run a linear model, we are fitting a line (predicted values) to our data (observed values)
    • the intercept is the predicted value of y (our outcome) when x (our predictor) is 0
    • the slope is the predicted change in y with an increase of 1-unit of x
    • the residuals measure the difference between our predicted values of y from the observed values of y

\[ \begin{align} observed\;values &= fitted\;values + residuals\\ residuals &= observed\;values - fitted\;values \end{align} \]

Our model

Check contrasts

contrasts(df_crit_verb$lifetime)
       dead
living    0
dead      1
contrasts(df_crit_verb$tense)
   SF
PP  0
SF  1

Re-order predictor levels

# set contrasts
df_crit_verb <- df_crit_verb %>% 
  mutate(lifetime = fct_relevel(lifetime, "living", "dead"),
         tense = fct_relevel(tense, "PP", "SF"))

Set sum coding

Set contrasts

contrasts(df_crit_verb$lifetime) <- c(-0.5, +0.5)
contrasts(df_crit_verb$tense) <- c(-0.5, +0.5)

Check contrasts

contrasts(df_crit_verb$lifetime)
       [,1]
living -0.5
dead    0.5
contrasts(df_crit_verb$tense)
   [,1]
PP -0.5
SF  0.5

Run model

# fit linear model
fit_fp <- df_crit_verb %>%
  lm(fp ~ lifetime*tense, data = .)

Assumptions

  • refer to residuals
    • i.e., the difference between the observed and the fitted (predicted) values
  1. normality assumption
    • residuals of the model are (approximately) normally distributed
  2. constant variance assumption (homoscedasticity)
    • spread of residuals should be (approximately) equal along the regression line
  3. absence of collinearity
    • predictors should not be correlated with each other
  4. independence
    • data points should be independent from each other

Checking assumptions

  • we typically assess normality and homoscedasticity visually with histograms, Q-Q plots, and residual plots
  • we can assess collinearity with VIFs
  • the assumption of independence is dealt with conceptually

Image source: Winter (2019) (all rights reserved)

Constant variance: Residual plot

  • doesn’t look like a blob with categorical data, but rather stripes
    • what’s important is that the distribution (vertically) is roughly similar across stripes
    • why are predicted values for categorical data stripey?

Normality assumption

  • can be inspected e.g., with a histogram or Q-Q plot
df_crit_verb |> 
  filter(fp > 0) |> 
  mutate(half = if_else(trial >= 104, "1st","2nd")) |> 
  ggpubr::ggqqplot(x = "fp")

Normality assumption

  • how about by participant and experimental half?
df_crit_verb |> 
  filter(fp > 0) |> 
  mutate(half = if_else(trial >= 104, "1st","2nd")) |> 
  ggpubr::ggqqplot(x = "fp",
                    color = "half",
                    facet.by = "px")

Normality assumption

  • normal distribution is symmetrical, are our residuals normally distributed?

Figure 1: Visualising normality of residuals

Normality assumption

  • reading time data tends to be positively skewed
    • so the residuals also tend to be positively skewed
  • data with a skewed distribution do not meet the normality assumption
  • a fix: nonlinear transformations
    • the most common: the log transformation
  • log-transforming your data makes larger numbers smaller (and small numbers smaller too)
    • the difference between smaller numbers and larger numbers shrinks
    • can make skewed data normally distributed

Log transformation

  • for more see Section 5.4 in Winter (2019)

  • the R funtion log() computes the ‘natural logarithm’ (and is the inverse of the exponential exp())

  • log() makes large numbers smaller

  • exp() makes small numbers larger

log(2)
[1] 0.6931472
log(1:10)
 [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
 [8] 2.0794415 2.1972246 2.3025851
log(c(10,20,30,40,100))
[1] 2.302585 2.995732 3.401197 3.688879 4.605170
exp(1:10)
 [1]     2.718282     7.389056    20.085537    54.598150   148.413159
 [6]   403.428793  1096.633158  2980.957987  8103.083928 22026.465795
exp(log(1:10))
 [1]  1  2  3  4  5  6  7  8  9 10

Fit model (log-transformed)

  • continuous variables truncated at 0 typically have a positive skew
    • a lot of small values (e.g., tt < 500ms), with some larger values (> tt 1000)
    • this usually means our residuals are also positively skewed, i.e., not normally distributed
  • so we typically log-transform raw reading/reaction times for our linear models
# fit simple linear model with log
fit_fp_log <- df_crit_verb %>%
  filter(fp > 0) %>% # important! you can't log transform 0
  lm(log(fp) ~ lifetime*tense, data = .)
summary(fit_fp_log)

Call:
lm(formula = log(fp) ~ lifetime * tense, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.18141 -0.34282  0.00058  0.26923  1.38822 

Coefficients:
                 Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       5.63477    0.01877 300.193 < 0.0000000000000002 ***
lifetime1         0.10130    0.03754   2.698              0.00718 ** 
tense1           -0.03357    0.03754  -0.894              0.37158    
lifetime1:tense1 -0.08320    0.07508  -1.108              0.26828    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4374 on 539 degrees of freedom
Multiple R-squared:  0.01712,   Adjusted R-squared:  0.01165 
F-statistic: 3.129 on 3 and 539 DF,  p-value: 0.02538

Check assumptions

Raw first-pass reading times

plot(density(resid(fit_fp)))

Log first-pass reading times

plot(density(resid(fit_fp_log)))

Check assumptions

Raw first-pass reading times

qqnorm(residuals(fit_fp))
qqline(residuals(fit_fp), col="red")

Log first-pass reading times

qqnorm(residuals(fit_fp_log))
qqline(residuals(fit_fp_log), col="red")

Check assumptions: constant variance

Raw first-pass reading times

fit_fp %>% 
ggplot(aes(x = .fitted, y = .resid)) +
  labs(title = "Residual plot (fp ~ lifetime*tense)") +
  geom_point() +
  geom_hline(yintercept = 0) +
  theme_bw()  # Add theme for cleaner look

Log first-pass reading times

fit_fp_log %>% 
ggplot(aes(x = .fitted, y = .resid)) +
  labs(title = "Residual plot (log(fp) ~ lifetime*tense)") +
  geom_point() +
  geom_hline(yintercept = 0) +
  theme_bw()  # Add theme for cleaner look

Approaching normality: log

  • so it seems like the log-transformed values have made the data more normal

Interpreting log

augment(fit_fp_log, data = df_crit_verb[df_crit_verb$fp > 0,]) %>% 
  distinct(lifetime,tense,.fitted) %>% 
  arrange(lifetime) 
# A tibble: 4 × 3
  lifetime tense .fitted
  <fct>    <fct>   <dbl>
1 living   PP       5.58
2 living   SF       5.59
3 dead     PP       5.72
4 dead     SF       5.65
df_crit_verb %>% 
  mutate(predicted_raw = predict(fit_fp),
         predicted_log = predict(fit_fp_log)) %>% 
  distinct(lifetime,tense,predicted_raw, predicted_log) %>% 
  arrange(lifetime) 
# A tibble: 4 × 4
  lifetime tense predicted_raw predicted_log
  <fct>    <fct>         <dbl>         <dbl>
1 living   PP             294.          5.58
2 living   SF             292.          5.59
3 dead     PP             337.          5.72
4 dead     SF             313.          5.65
exp(coef(fit_fp_log)['(Intercept)'])
(Intercept) 
   279.9937 
coef(fit_fp)['(Intercept)']
(Intercept) 
   309.0606 
  • the exponential (back-transformation) of the log slopes correspond to the change in percentage
(exp(coef(fit_fp_log)['lifetime1'])-1)*100
lifetime1 
 10.66132 
(exp(coef(fit_fp_log)['tense1'])-1)*100
   tense1 
-3.301406 
# living-SF
coef(fit_fp)['(Intercept)'] + (coef(fit_fp)['lifetime1'] * -0.5)
(Intercept) 
    293.302 
exp(coef(fit_fp_log)['(Intercept)']) * (exp(coef(fit_fp_log)['lifetime1']* -0.5))
(Intercept) 
   266.1646 
exp(coef(fit_fp_log)['(Intercept)']) * (exp(coef(fit_fp_log)['lifetime1']* +0.5))
(Intercept) 
   294.5413 
# living-SF
exp(coef(fit_fp_log)['(Intercept)']) * exp(coef(fit_fp_log)['lifetime1'] * -0.5 + coef(fit_fp_log)['tense1'] * -0.5 )
(Intercept) 
   270.6701 
# dead-PP
exp(coef(fit_fp_log)['(Intercept)']) * exp(coef(fit_fp_log)['lifetime1'] * +0.5 + coef(fit_fp_log)['tense1'] * -0.5 )
(Intercept) 
   299.5271 
# dead-SF
coef(fit_fp_log)['(Intercept)'] + coef(fit_fp_log)['lifetime1'] * 0.5 + coef(fit_fp_log)['tense1'] * 0.5 
(Intercept) 
   5.668634 

Collinearity

  • your predictors should not be correlated
    • this can be checked using the ‘variance inflation factors’ (VIFs), which measure the degree to which a predictor can be accounted for by other predictors
  • there are different guidelines for appropriate cut-offs for VIFs, with some considering a VIF >10 indicates colliniearity
pacman::p_load(car)
vif(fit_fp)
      lifetime          tense lifetime:tense 
      1.000169       1.000169       1.000007 
  • importantly, sample size interactions with collinearity (the more data you have, the more precisely you can measure even in the face of collinearity)
    • so, an alternative to dropping a predictor is to collect more data (if possible)

Independence assumption

  • important in repeated measures design, where we collect multiple data points across conditions per participant
  • the importance of the indepdence assumption far outweighs that of the other assumptions
  • having data points that are linked, but not telling your model this information, can massively inflate your Type I error (probability of finding a false positive)

What is (in)dependence of data?

  • e.g., each roll of a die is dependent
    • that is, if you roll a 6, this has no bearing on what you will roll next
    • if you roll a die 10 times and I roll a die 10 times, the values we get will not be linked to who rolled the die
  • reading times are dependent
    • if we collect eye-tracking reading times from two people, the data points from each person will be dependent
      • this is because each person will have their own reading pace, so their data points will tend to cluster
    • also true for experimental item
      • some items will likely elicit a stronger effect than others

Check independence

  • independence is a conceptual consideration, it can’t be checked visually or numerically
  • it is a question of experimental design
  • a way to think about it: can my data be linked/clustered/grouped by some means?
    • e.g., yes, by participant, by item, even by verb
  • there of course are many ways our data might cluster, contributing random error that would not be expected to be repeated
    • we have specific predictions for the effects of lifetime or tense, and would expect them to replicate if we were to re-run the study
    • but we cannot predict how certain participants will pattern, nor how certain items will pattern
    • these are random variables, essentially error that we can try to explain by means of telling the model “check for effects by participant and item”

Accounting for independence

  • there is a very powerful tool that allows us to include dependence in our models: linear mixed models (LMMs)
  • LMMs are mixed because they include fixed effects (lifetime, tense), and random effects (in eye-tracking experiments, typically participant, item)
    • also called linear mixed effects models (LMEMs), or multilevel models (or hierarchical models in some cases)

performance package

  • the performance package has a nice function check_model() that produces 6 plots to check model fit
performance::check_model(fit_fp_log)

Figure 2: Performance of model with log reading times

performance::check_model(fit_fp)

Figure 3: Performance of model with log reading times

Communicating your results

  • model summaries can be provided via tables and/or figures
    • you should always report the t-values and p-values of an effect
  • this is where the tidy() function from broom is handy in combination with knitr::kable(digits = x) and kableExtra::kable_styling()
tidy(fit_fp_log) %>% 
  kable(digits = 10,
        col.names = c("Coefficient",
                      "Estimate (log)",
                      "SE",
                      "t-value",
                      "p-value")) %>% 
  kable_styling()
Coefficient Estimate (log) SE t-value p-value
(Intercept) 5.63476711 0.01877047 300.1931843 0.000000000
lifetime1 0.10130415 0.03754094 2.6984981 0.007183743
tense1 -0.03357132 0.03754094 -0.8942589 0.371582493
lifetime1:tense1 -0.08320391 0.07508188 -1.1081757 0.268280198

P-value formatting

  • you can create a function the replaces p-values
# source: https://stackoverflow.com/questions/37470202/in-line-code-for-reporting-p-values-001-in-r-markdown
# OR USE
pacman::p_load(broman) 
format_pval <- function(x){
  if (x < .001) return(paste('<', '.001'))
  if (x < .01) return(paste('<', '.01'))
  if (x < .05) return(paste('<', '.05'))
  paste('=', myround(x, 3))  # if above .05, print p-value to 3 decimalp points
}
make_stars <- function(pval) {
  stars = ""
  if(pval <= 0.001)
    stars = "***"
  if(pval > 0.001 & pval <= 0.01)
    stars = "**"
  if(pval > 0.01 & pval <= 0.05)
    stars = "*"
  if(pval > 0.05 & pval <= 0.1)
     stars = "."
  stars
}

Formatted table

tidy(fit_fp_log) %>% 
  mutate(format = sapply(p.value, function(x) format_pval(x))) %>% 
  mutate(signif = sapply(p.value, function(x) make_stars(x))) %>%
  select(-p.value) %>% 
  kable(digits = 10,
        col.names = c("Coefficient",
                      "Estimate (log)",
                      "SE",
                      "t-value",
                      "p-value",
                      "sign")) %>% 
  kable_styling()
Coefficient Estimate (log) SE t-value p-value sign
(Intercept) 5.63476711 0.01877047 300.1931843 < .001 ***
lifetime1 0.10130415 0.03754094 2.6984981 < .01 **
tense1 -0.03357132 0.03754094 -0.8942589 = 0.372
lifetime1:tense1 -0.08320391 0.07508188 -1.1081757 = 0.268

Plots

df_crit_verb |> 
  filter(fp > 0) |> 
  mutate(log_ff = log(fp)) |> 
  mutate(half = if_else(trial >= 104, "1st","2nd")) |> 
  ggpubr::ggqqplot(x = "log_ff")

sjPlot

Figure 4: ?(caption)

Distributions

Figure 5: ?(caption)

Errorbar

fig_error <-
  df_crit_verb %>% 
  filter(fp > 0) %>% 
  summarySEwithin(measurevar="fp", withinvars=c("lifetime", "tense"), idvar="px") %>% 
  mutate(upper = fp+ci,
         lower = fp-ci) %>% 
  ggplot(aes(x = lifetime, y = fp, colour = tense, shape = tense)) + 
  labs(title = "Mean first-pass reading times (with 95% CIs)") +
  geom_point(position = position_dodge(0.2), size = 2) +
  geom_line(position = position_dodge(0.2), aes(group=tense)) +
  geom_errorbar(aes(ymin=lower,ymax=upper), position = position_dodge(0.2), width = .2) +
  theme_bw() +
  theme(text = element_text(size=8))

fig_error_log <-
  df_crit_verb %>% 
  filter(fp > 0) %>% 
  mutate(fp_log = log(fp)) %>% 
  summarySEwithin(measurevar="fp_log", withinvars=c("lifetime", "tense"), idvar="px") %>% 
  mutate(upper = fp_log+ci,
         lower = fp_log-ci) %>% 
  ggplot(aes(x = lifetime, y = fp_log, colour = tense, shape = tense)) + 
  labs(title = "Mean first-pass reading times (with 95% CIs)") +
  geom_point(position = position_dodge(0.2), size = 2) +
  geom_line(position = position_dodge(0.2), aes(group=tense)) +
  geom_errorbar(aes(ymin=lower,ymax=upper), position = position_dodge(0.2), width = .2) +
  theme_bw() +
  theme(text = element_text(size=8))

Figure 6: ?(caption)

Summary

  • we looked at some assumptions of linear models and how to check them

  • we saw how to log-transform our data so that our residuals are normally distributed

  • we learned the importance of the independence assumption, which is violated by repeated-measures designs

  • next, we’ll learn how to account for lack of independence between the data with mixed models

References

Winter, B. (2019). Statistics for Linguists: An Introduction Using R. Routledge. https://doi.org/10.4324/9781315165547